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Abstract 


This report presents a rapid Byzantine-fault-tolerant self- 
stabilizing clock synchronization protocol that is independent of 
application-specific requirements. It is focused on clock 
synchronization of a system in the presence of Byzantine faults 
after the cause of any transient faults has dissipated. A model of 
this protocol is mechanically verified using the Symbolic Model 
Verifier (SMV) [SMV] where the entire state space is examined 
and proven to self-stabilize in the presence of one arbitrary faulty 
node. Instances of the protocol are proven to tolerate bursts of 
transient failures and deterministically converge with a linear time 
with respect to the synchronization period. This protocol does not 
rely on assumptions about the initial state of the system, other than 
the presence of sufficient number of good nodes. All timing 
measures of variables are based on the node ’s local clock, and no 
central clock or externally generated pulse is used. The Byzantine 
faulty behavior modeled here is a node with arbitrarily malicious 
behavior that is allowed to influence other nodes at every clock 
tick. The only constraint is that the interactions are restricted to 
defined interfaces. 
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1. Introduction 


This report presents a clock synchronization protocol and the proof of its correctness for 
specific cases. For an introduction to the clock synchronization and self-stabilization problems 
the reader is referred to the introductory sections in [Mai 2006A, 2006B, 2007, 2008]. 

A Byzantine-Fault-Tolerant Self-Stabilizing Protocol for Distributed Clock 
Synchronization Systems was reported in [Mai 2006A, 2006B, 2007, 2008]. Claims about the 
protocol were validated via mechanical verification of a system consisting of one Byzantine 
faulty node [Mai 2007, 2008]. Further analysis of the proofs revealed a potential simplification 
of this protocol. Having mechanically verified the protocol, it is now possible to explore 
variations of the protocol. What is presented here is a new protocol that is a direct result of this 
exploration and re-verification of the protocol reported in [Mai 2006A, 2006B, 2007, 2008]. 

The protocol in [Mai 2006 A, 2006B, 2007] requires periodic transmission of Affirm 
messages to guarantee the presence and participation of all good nodes. Assuming that the good 
nodes are actively participating in the self- stabilization process, periodic transmission of Affirm 
messages can be inferred and, thus, periodic arrival of Affirm messages can be assumed. As a 
result, transmission of the Affirm messages can be eliminated. Therefore, only one self- 
stabilization message. Resync, suffices. Nevertheless, it is worth noting that periodic 
transmission of the Affirm messages by the good nodes not only reduces the error detection time 
but also expedites the reintegration process. In [Mai 2006A, 2006B, 2007, 2008] an accept event 
counter was introduced to account for the arrival and accumulation of Affirm messages. 
Assuming the presence of periodic Affirm messages, the corresponding behavior of the protocol 
is now compensated by keeping track of elapsed time. 

This report extends the results of the basic case [Mai 2007, 2008] to larger systems and 
examines synchronization of a system of > 3F -i- 1 nodes in the presence of multiple Byzantine 
faulty nodes. Analysis of larger systems revealed that a direct generalization of the results of the 
basic case is not applicable to larger systems. Although this protocol solves the basic case, it 
does not synchronize a system of > 3F -i- 1 nodes in the presence of F Byzantine faulty nodes 
when F > 1 . 

The proof presented here applies to all instances and applications of this protocol and to 
those reported in [Mai 2006A, 2006B, 2007, 2008]. Although this proof parallels that of [Mai 
2006A, 2006B, 2007, 2008], the proof is redone and restructured to make it easier to follow, 
simpler to analyze, and, thus, easier to comprehend. Elimination of the Affirm messages resulted 
in a reduction of the number of parameters and, hence, the initial state space. The mechanical 
verification of the protocol presented in this report is now more manageable and can be 
conducted in shorter amount of time on computers with less memory. Furthermore, if more 
memory and computing power were available, larger and more complex systems could be 
analyzed. Also, in the absence of periodic transmission of Affirm messages, implementation of 
the implicit fault model [Mai 2007] of the faulty nodes in the mechanical verification models is 
now practical. 
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In this report, a rapid Byzantine self-stabilizing clock synchronization protocol is 
presented. Specific cases of this protocol are demonstrated to self-stabilize from any state, 
tolerate bursts of transient failures, and deterministically converge within a linear time with 
respect to the synchronization period. Upon stabilization, all good clocks proceed 
synchronously. 


2. System Overview 

The underlying topology considered here is a network of > 3F -t 1 nodes that exchange 
messages through a set of communication channels. A maximum of F Byzantine faulty nodes 
are assumed to be present in the system, where F > 0. The communication channels are assumed 
to connect a set of source nodes to a set of destination nodes such that the source of a given 
message is uniquely identifiable from other sources of messages. The minimum number of good 
nodes in the system, G, is defined by G = K-F nodes. The nodes communicate with each other 
by exchanging broadcast messages. Broadcast of a message to all other nodes is realized by 
transmitting the message to all other nodes at the same time. The communication network does 
not guarantee any relative order of arrival of a broadcast message at the receiving nodes, and a 
consistent delivery order of a set of messages does not necessarily reflect the temporal or causal 
order of the message transmissions [Kop 1997]. 

Each node is driven by an independent local physical oscillator with one oscillation 
representing a local clock tick. The oscillators of good nodes have a known bounded drift rate, 
0 < /? « 1, with respect to real time. 

Each node has two primary logical time clocks, StateTimer and LocalTimer, which 
locally keep track of the passage of time as indicated by the local clock tick. There is neither a 
central system clock nor an externally generated global pulse. 

The faulty communication channels and nodes can behave arbitrarily provided that 
eventually the system adheres to the protocol assumptions (see Section 3.5). 

The communication latency between the nodes is expressed in terms of the minimum 
event-response delay, D, and network imprecision, d. These parameters are described with the 
help of Eigure 1. As depicted, a message transmitted at real time to is expected to arrive at all 
destination good nodes, be processed, and subsequent messages generated within the time 
interval of {to + D, to + D + d\. Communication between independently clocked nodes is 
inherently imprecise. The network imprecision, d, is the maximum time difference among all 
good receivers, Nj, of a message from good node Ni with respect to real time. The imprecision is 
due to the drift of the clocks with respect to real time, jitter, discretization error, temperature 
effects and differences in the lengths of the physical communication media. These two 
parameters are assumed to be bounded such that D > 1 and d>0 and both have values with units 
of nominal clock tick. 
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Figure 1 . Event-response delay, D, and network imprecision, d. 

2.1. Gamma (y) 

The timeline of activities of a node is partitioned into a sequence of equal duration 
intervals from the time the node transitioned to a new state. The duration of these intervals, 
denoted % is expressed in terms of D and d, constrained such that y > {D + d), and measured by 
the local oscillator. Unless stated otherwise, all time-dependent parameters of this protocol are 
measured locally and expressed as functions of y. The time-driven activities of a node take place 
at equal intervals measured by the local oscillator since the node entered a new state. In contrast, 
event-driven activities are independent of y and, thus, take place immediately. 

3. Protocol Description 

The system is in the steady state when it is stabilized. In order to achieve stabilization, 
the nodes communicate by exchanging a Sync message. A Sync message is transmitted either as 
a result of a resynchronization timeout, or when a node determines that sufficient number of 
other nodes have engaged in the resynchronization process. 

Three fundamental parameters characterize the self-stabilization protocol presented 
here, namely K, D, and d. The maximum number of faulty nodes, F, the minimum number of 
good nodes, G, the j'intervals, and the remaining parameters that are subsequently presented are 
derived parameters based on the fundamental parameters. 

3.1. Message Validity 

Only one message is required for the operation of the protocol. Receiving a Sync 
message is indicative of its validity in the value domain. The protocol performs as intended 
when the timing requirements of the received messages from all good nodes at all other good 
nodes are satisfied. The time interval between any two consecutive Sync messages from a node 
is denoted Ass, and the shortest such interval for a good node is denoted Ass, min- The following 
definitions apply at the receiving nodes. 

• A Sync message from a given source is valid if it arrives at or after Ass, min of an 
immediately preceding Sync message that is valid in the value domain. 

• While in the Maintain state, a Sync message from a given source remains valid for the 
duration of that state. 

• While in the Restore state, a Sync message from a given source remains valid for the 
duration of one y. 


3 



3.2. The Monitor 


The messages to be delivered to the destination nodes are deposited on communication 
channels. A node consists of a state machine and a set of monitors. To assess the behavior of 
other nodes, a node employs (.^f-1) monitors, with one monitor for each source of incoming 
messages, as shown in Figure 2. A node neither uses nor monitors its own messages. The 
distributed observation of other nodes localizes error detection of incoming messages to their 
corresponding monitors, and allows for modularization and distribution of the self-stabilization 
protocol process within a node. A monitor keeps track of the activities of its corresponding 
source node. Specifically, a monitor reads, evaluates, time-stamps, validates, and stores the last 
message it receives from that node. A monitor maintains a logical timer, MessageTimer, by 
incrementing it once per local clock tick. This timer is reset upon receiving a Sync message. A 
monitor also disposes retained valid messages as indicated by the protocol (Sections 4). 


Node i 



Figure 2. The i* node, A„ with its monitors and state machine. 


3.3. The State Machine 

The assessment results of the monitored nodes are utilized by the node in the self- 
stabilization process. The state machine has two states. Restore (R) and Maintain (M), that 
reflect the current state of the node in the system as shown in Figure 3. The state machine 
triggers a Sync message broadcast when it transitions from the Restore state to the Maintain 
state. The state machine describes the behavior of the node, A„ utilizing assessment results from 
its monitors, M] .. M,-.;, .. Mk as shown in Figure 2, where Mj is the monitor for the 

corresponding node Nj. In addition to the behavior of its corresponding source node, a monitor's 
internal status is influenced by the current state of the node’s state machine. When the state 
machine transitions to the Restore state, the monitors update their internal status as appropriate 
(Section 3.4). 
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Figure 3. The node state machine. 

The transitory conditions enable the node to migrate from the Restore state to the 
Maintain state. Although during the self-stabilization process a node may transition from the 
Restore state to the Maintain state upon a timeout, during steady state such a timeout is 
indicative of an abnormal situation. The transitory conditions are defined with respect to the 
steady state where such timeouts do not occur. The transitory delay is the length of time a node 
stays in the Restore state. The minimum required duration for the transitory delay is denoted by 
TDjnin, and the maximum duration by TD^ax- TD,nm is a derived parameter and a function of F. 
For the fully connected topology considered here, the transitory conditions are defined as 
follows. 

1 . The node has remained in the Restore state for at least TD^in, where 

TD^in = 2, for F = 0, or 
TDmin = 2F, for F > 0, and 

2. One y has passed since the arrival of the last valid Sync message. 

The maximum duration of the transitory delay, denoted TDmax, after meeting the FD„„„ 
requirement depends on the number of additional valid Sync messages received and the drift rate 
p. The upper bound for TDmax during steady state will be determined later in this report. 

In the Restore state, the node will either meet the transitory conditions and transition to 
the Maintain state, or remain in the Restore state for a predetermined maximum duration until it 
times out and then transitions to the Maintain state. In the Maintain state, a node will either 
transition to the Restore state when at least Tr other nodes have transitioned out of the Maintain 
state as indicated by the reception of at least Tr valid Sync messages, or remain in the Maintain 
state for a predetermined maximum duration until it times out and transitions to the Restore state. 
The derived parameter Fr is defined as Fr = F -i- 1 and is used as a threshold in conjunction with 
the Sync messages. 

In Figure 4 the transitions of a good node to the Restore state and from the Restore state 
to the Maintain state (during steady state) are depicted along a timeline of node activities. A 
Sync message is transmitted as the node transitions from the Maintain state to the Restore state. 
Activities of the StateTimer and LocalTimer of the node as it transitions between different states 
are also depicted in this figure. 
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Figure 4. Activities of a good node during steady state. 

The clocks need to be periodically synchronized due to their inherent drift with respect to 
each other. The periodic synchronization during steady state is referred to as the 
resynchronization process, whereby all good nodes transition to the Restore state and then 
synchronously to the Maintain state. The resynchronization process begins when the first good 
node transitions to the Restore state and ends after the last good node transitions to the Maintain 
state. An upper bound on the duration of the re synchronization process will be determined later 
in this report. 

The synchronization period during steady state, denoted P, is defined as the largest time 
interval between two consecutive resets of the LocalTimer by a good node. The synchronization 
period depends on the maximum duration of both states of the state machine. The maximum 
duration for the Restore state is denoted by Pr, and the maximum duration for the Maintain state 
is denoted by Pm, where Pr and Pm are expressed in terms of f- The length of time that a good 
node stays in the Restore state is denoted by Lr. During steady state Lr is always less than Pr. 
The length of time a good node stays in the Maintain state is denoted by Lm. The 
synchronization period, P, is defined hy P = Pr + Pm and is expressed in terms of Y- The actual 
synchronization period, PActuai, is the time interval (during steady state) between the last two 
consecutive resets of the LocalTimer of a good node, where PActuai = Lr + Lm <P. 

A node keeps track of time by incrementing its logical time clock StateTimer once every 
Y After the StateTimer reaches Pr or Pm, depending on the current state of the node, the node 
times out, resets the StateTimer, and transitions to the other state. If the node was in the 
Maintain state, it transmits a new Sync message. The current value of this timer reflects the 
duration of the current state of the node. 
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This protocol is expected to be used as the fundamental mechanism to bring and maintain 
a system within a known synchronization precision bound. Therefore, the protocol has to 
properly filter out inherent oscillations in the StateTimer during the resynchronization process as 
depicted in Figure 4. This is resolved by using the LocalTimer in the protocol. The LocalTimer 
is intended to be used by higher level protocols and must be managed properly to provide the 
desired monotonically increasing value between adjustments. The logical time clock LocalTimer 
is incremented once every local clock tick and is reset either when it reaches its maximum 
allowed value or when the node has transitioned to the Maintain state and remained in that state 
for ResetLocalTimerAt local clock ticks, where ResetLocalTimerAt is constrained by the 
following inequality: 

r ^ ResetLocalTimer At ^ Pm ~ T Apy^cision (1) 

The synchronization precision, denoted Apredsion, is the guaranteed upper bound on the maximum 
separation between the LocalTimers of any two good nodes. The ResetLocalTimerAt can be 
given any value in the range specified in inequality (1). However, the value must be the same at 
all good nodes. In this equality, the lower bound indicates when all good nodes have transitioned 
to the Maintain state and the upper bound indicates when the first node might transition out of 
the Maintain state. We choose the earliest such value, ResetLocalTimerAt = [ Apredsion / to reset 

the LocalTimer of all good nodes. Any value greater than \ Apredsion 1'}^ will prolong the 
convergence time. 

The LocalTimer is also used in assessing the state of the system in the resynchronization 
process and is bounded by P'Y- During steady state, the value of LocalTimer is always less than 
P'Y 


3.4. Protocol Functions 

The functions used in the protocol are described in this section. 

The function InvalidSync() is used by the monitors. This function determines whether a 
received Sync message is invalid. When this function returns a true value, it indicates that an 
unexpected behavior by the corresponding source node has been detected. 

The function ConsumeMessage( ) is used by the monitors. When the host node is in the 
Restore state, the monitor invalidates the stored Sync message after it has been kept for one f. 

The Retry 0 function determines if at least Tp other nodes have transitioned out of the 
Maintain state, where Tr = F +1. When at least Tr valid Sync messages from as many nodes have 
been received, this function returns a true value indicating that at least one good node has 
transitioned to the Restore state. This function is used to transition from the Maintain state to the 
Restore state. 

The Transitory ConditionsMetO function determines proper timing of the transition from 
the Restore state to the Maintain state. This function keeps track of the passage of time by 
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monitoring the StateTimer and determines if the node has been in the Restore state for at least 
TDmin- It returns a true value when the transitory conditions are met. 

The TimeOutRestorei) function asserts a timeout condition when the value of the 
StateTimer has reached Pr in the Restore state. Such timeout triggers the node to transition to 
the Maintain state. 

The TimeOutMaintainO function asserts a timeout condition when the value of the 
StateTimer has reached Pm in the Maintain state. Such timeout triggers the node to reengage in 
another round of the resynchronization process. 

In addition to the above functions, the state machine utilizes the TimeOutGammaTimer( ) 
function, which is used to regulate node activities at the y boundaries. It maintains a 
GammaTimer by incrementing it once per local clock tick. Once the value of the GammaTimer 
reaches % it is reset and the function returns a true value. 

3.5. Protocol Assumptions 

The protocol assumptions are as follows. 

1 . The cause of transient faults has dissipated. 

2. At most F of the nodes remain faulty. 

3. All good nodes correctly execute the protocol. 

4. The source of a message is uniquely identifiable by the receivers. 

5. A message sent by a good node will be received and processed by all other good nodes 
within y where y >ip + d). 

6. The initial values of the variables of a node can be set to arbitrary values within their 
corresponding range. (In an implementation, it is expected that some local mechanism 
exists to enforce type consistency for all variables.) 

3.6. The Self-Stabilizing Clock Synchronization Problem 

To simplify the presentation of this protocol, it is assumed that all time references are 
with respect to an initial real time to, where to = 0 when the protocol assumptions are satisfied, 
and for all t>to the system operates within the protocol assumptions. 

We define the following symbols: 

• C denotes a bound on the maximum convergence time, 

• ^LocaiTimer(t), for real time t, is the maximum difference of values of the LocalTimers of 
any two good nodes, and 

• Aprecision, thc Synchronization precision, is the guaranteed upper bound on ALocaiTimedO- 

The maximum difference in the value of LocalTimer for all pairs of good nodes at time t, 
ALocaiTimer(t), Is determined by the following equation while accounting for the variations in the 
values of the LocalTimert across all good nodes. 

AiocaiTimedt) = min ((LocalTimer„uix(t) - LocalTimerminit)), 

(LocalTimermaxit - r) - LocalTimerminit -r))). 
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where, 


^ ~ r^Precision /y~lt 

LocalTimerrni„(x) = min (LocalTimeri(x)), 

LocalTimermax(x) = max (LocalTimeri(x)). 

There exist C and Apredsion such that the following self- stabilization properties hold. 

1. Convergence: ALocammer(C) < Aprecision 

2. Closure. \/t ^C, ApocalTimer(^) — Api-gdsion 

3. Congruence: 1/ good nodes Ni and Nj, Vt >C, LocalTimery t) = 0^ 

Ni and Nj are in the Maintain state. 

The values of Ass,min, C, Apredsion, and the maximum value for LocalTimer, P, are determined to 
be: 

Ass.min — (.TDmin'Y 1). 

C = (2Pr + Pm)'% 

Apredsion ~ (2>F - 1)*^ - Z) -t Aj)fijti 

P = Pr + Pm, 

Pm » Pr, 


where the amount of drift from the initial precision is given by 
Aorift = {{\+p)-\i{,\+p))p-r. 

Note that since P > ViPp and the LocalTimer is reset after reaching P (worst case wraparound), a 
trivial solution is not possible. 
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4. The Clock Synchronization Protocol 

The protocol presented in Figure 5 consists of a state machine and a set of monitors 
which execute once every local oscillator tick. 


Monitor: 

case (incoming message from the corresponding node) 
[Resync: 

if InvalidSync( ) then 

Invalidate the message 

else 

Validate and store the message. 

Other: 

Do nothing. 

} // case 

ConsumeMessage( ) 


Node: 

case (state of the node) 
[Restore: 

Maintain: 

if TimeOutRestorei ) then 

if TimeOutMaintainO or Retry() then 

Reset StateTimer, 

Transmit Sync message. 

Go to Maintain state, ^ 

Reset StateTimer, 

else 

if Transitory ConditionsMetO then 

Go to Restore state, 
elseif TimeOutGammaTimerO then 

Reset StateTimer, 

if {StateTimer - \ Aprecisionlf\) 

Go to Maintain state. 

Reset LocalTimer., 

else 

Stay in Restore state. 

Stay in Maintain state, 

else 


Stay in Maintain state. 

} // case 


Figure 5. The self-stabilization protocol. 

t In [Mai 2006A, 2006B, 2007, 2008], upon TimeOutRestore(), the node transmitted a Sync 
message and remained in the Restore state. The modification introduced here simplifies the 
proof argument and does not change the properties of the protocol. 
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Semantics of the pseudo-code 

• Indentation is used to show a block of sequential statements. 

• is used to separate sequential statements. 

• ‘.’is used to end a statement. 

• is used to mark the end of a statement and at the same time to separate it from other 
sequential statements. 


5. Proof 

The lemmas and theorems are presented in this section. The proof approach is to show 
that a system of > 3F + 1 nodes asynchronously converges from any (un-stabilized) condition 
to a condition where all good nodes are in the Restore state and then synchronously transition to 
the Maintain state within a guaranteed initial precision (stabilized). The system is then shown to 
remain within the timing bounds of the synchronization precision Apredsion- This idea is depicted 
in Figure 6. 



Figure 6. The proof idea. 

To achieve this goal, first, a good node is shown to transition from the Restore state to the 
Maintain state and visa versa infinitely often. Second, the analysis consists of three possible 
scenarios where none, some, or all good nodes are in the Maintain state. Third, the good nodes 
are shown to transition to the Restore state after the elapse of some time and then synchronously 
to the Maintain state. Finally, the system is shown to transition between these two states 
infinitely often while preserving the synchronization precision. 

Since the oscillator drift rate, p, does not play a significant role in the convergence 
process, it is omitted from the expressions regarding parameters, constants, equations, and the 
proofs of convergence. However, p does affect the closure property and is included in 
expressions regarding Apredsion- Omission of p does not change the behavior of the protocol or 
the validity of the proofs. The effect of p is later visited to show that its omission in the 
convergence process as well as the re synchronization process is justified. 

Throughout the proofs, the protocol assumptions apply and unless stated otherwise, all 
references to the Sync messages are with respect to valid Sync messages. 

A node behaves properly if it correctly executes the protocol. 
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Lemma MaintainWithinPR - A good node in the Restore state transitions to the Maintain state 
within at most Pr. 

Proof - It follows from the protocol that a node in the Restore state will transition to the 
Maintain state either after meeting the transitory conditions as expressed in function 
TransitoryConditionsMetO, or because of a resynchronization timeout, as expressed in function 
TimeOutRestorei ). Therefore, a node transitions to the Maintain state within at most Pr. ♦ 

Lemma RestoreWithinPM - A good node in the Maintain state transitions to the Restore state 
within at most Pm- 

Proof - It follows from the protocol that a node in the Maintain state will transition to the 
Restore state either because of a resynchronization timeout, as expressed in function 
TimeOutMaintain( ), or when at least Tr other nodes have transitioned out of the Maintain state, 
as expressed in function Retry(). Since the longest such time interval is bounded by the timeout, 
the node transitions to the Restore state and transmits a Sync message in at most Pm- ♦ 

Lemma ShortestRestore - The minimum duration of the Restore state is TDminY- 

Proof - From the definition of the transitory conditions, a node has to remain in the Restore state 
for at least TDmin'f- It also follows that if no valid Sync messages arrive during last % the node 
will transition to the Maintain state at the end of this time interval. Hence, the minimum 
duration of the Restore state is TDmin’Y- ♦ 

Lemma DeltaSSmin - The minimum time interval between any two consecutive Sync messages 
from a good node is Ass, min = TDmmY + ^ clock ticks. 

Proof - A node transmits a Sync message when it enters the Restore state. The amount of time 
the node stays in the Maintain state is defined as Amr and depicted in Figure 7. From Lemma 
ShortestRestore the minimum duration of the Restore state is TDmin’Y The time separation 
between any two consecutive Sync messages from a good node is given by Ass ^ TDmin’Y + 
clock ticks. Since the message processing time is non-zero, Amr ^ 1 clock tick, and therefore. 
Ass, min — TDmin’Y ^ clock ticks. ♦ 


fMaintain ' ' 



^Restore 





I 1 I 1 


Y 


Figure 7. Shortest Maintain state. 


>time 


All good nodes validate a Sync message from a good node if the time interval between 
consecutive messages, i.e.. Ass, min, is not violated. By Lemma DeltaSSmin, consecutive Sync 
messages from a good node are always more than TDmin apart. Therefore, a message transmitted 
by a good node after Ass,min clock ticks from a random start is guaranteed to be valid. If a node is 
in the Restore state, from lemma MaintainWithinPR, it will transition to the Maintain state within 
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Pr. For now, let Pr > 6F. We will determine the minimum value for Pr later in this report. 
Since Pr is larger than Ass, min, after Pr from a random start, all Sync messages from a good node 
are at least Ass, min apart and meet the timing requirements at the receiving good nodes. 
Therefore, the pre-convergence conditions are defined as: 

1. Time has elapsed for at least Pr from a random start, i.e., t > to +Pr. 

2. All Sync messages from the good nodes are valid at the receiving good nodes (Lemma 

DeltaSSmin). 

Thus, for the following lemmas and theorems, the state of the system is considered after 
the pre-convergence conditions are met. At this point, the system is in one of the following three 
states. 

1 . None of the good nodes are in the Maintain state 

2. All good nodes are in the Maintain state 

3. Some of the good nodes are in the Maintain state 

The approach for the proof is depicted in Figure 8. The system is shown to converge 
from any state and upon convergence maintain the closure property. The figure is partitioned via 
two dashed lines into three regions. The left region depicts the pre-convergence conditions and 
in conjunction with the middle region they depict the state of the system in the convergence 
process. The right region depicts the system operating in steady state and maintaining the 
synchronization precision. In this figure, the states All, Some, and None represent the three 
possible cases from a random start when the pre-convergence conditions are met. The 
propositions associated with each edge indicate that a transition from one state to another may 
eventually take place. 



Pre-Convergence 





Convergence 


Closure 


Figure 8. Approach for proof. 
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Theorem ConvergeNoneMaintain - A system of K > 3F + 1 nodes satisfying the pre- 
convergence conditions, where none of the good nodes are in the Maintain state and for all good 
nodes, StateTimer(t) < Pr - 3F - 1, will always converge to within an initial precision. 

Proof - Since none of the good nodes are in the Maintain state, they are in the Restore state 
because they have not met the transitory conditions. We consider the system at the point of 
transmission of a Sync message by the last good node in the Maintain state. After the last good 
node transitions to the Restore state, transitioning of the good nodes back to the Maintain state 
can further be delayed only upon receiving valid Sync messages from the faulty nodes. Since 
there are up to F faulty nodes in the system and if their valid Sync messages are y apart from 
each other (worst case due to transitory condition 2), all good nodes will transition to the 
Maintain state within at most TDmin + F = 3F (assuming F > 0) of the last good node’s Sync 
message. Since all good nodes are in the Restore state, after receiving F valid Sync messages 
from as many faulty nodes, any subsequent Sync messages from the faulty nodes will arrive 
within TDmin and, thus, will be deemed invalid. As a result, for all good nodes, StateTimer(t) < 
Pr - I and they will transition to the Maintain state within the next j'without timing out. 


LM 


N, 


LM 


N, 


EM 


fMaintain 

\Restore 

fMaintain 

\Restore 


Y 




■-LM 


j ^ 


D 


EM 


Figure 9. EM and LM for all F > 0. 

The earliest a good node transitions to the Maintain state (EM) is at tEM after it has 
remained in the Restore state for the minimum duration of the transitory delay (transitory 
condition 1) and one y after receiving the last Sync message from the last good node that 
transitioned to the Restore state (transitory condition 2). Since Ass > TDmin % consecutive valid 
Sync messages from a faulty node are more than TDmin apart. Therefore, locally there will 
always be a gap of one y interval without a valid Sync message. As a result, a good node will 
meet the transitory conditions and transition to the Maintain state. 

As depicted in Figure 9, since the earliest the last Sync message can arrive at Nem node is 
D ticks after the transmission of the last valid Sync message, therefore, EM =D -v y. The latest a 
good node transitions to the Maintain state (LM) is at tim after remaining in the Restore state for 
TDmin + F, i.e., after receiving valid Sync messages from all faulty nodes. In this case, the tEM 
happens at the time the last good node transmits the Sync message, i.e., at tiM since its transition 
to the Restore state. So, LM = (TDmin -t- F) -y = 3F-y. Thus, the time difference between the Nem 
and Nem is given by: 

StateTimerEM(t) - StateTimerEM(t) = Aemem- 

Almem = LM - EM =3Dy - (D + y) = (3F - Ify - D. 
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Therefore, such a system always converges to within the initial precision of Almem- ♦ 

The synchronization precision Apredsion is the maximum time difference between the 
LocalTimer of any two good nodes when the system is synchronized. It is the guaranteed 
precision of the protocol. From theorem ConvergeNoneMaintain, the initial guaranteed 
precision after the resynchronization is the maximum value of Almem- After the initial synchrony 
the LocalTimers of the good nodes will deviate from the initial precision due to the drift rate of 
the oscillators. This phenomenon is depicted in Figure 10. 

Slow 

0 

Fast I- 


‘LMEM ^Precision 


P 

P ' 


Figure 10. The synchronization precision. 


by, 


The guaranteed synchronization precision Apredsion after an elapsed time of P is bounded 


Apredsion ~ ApMEM Ap)j-ift-, 

where, the amount of drift from the initial precision is given by 

Aorift = 


where, P = Pp + Pm- The factors (1+/?) and 1/(1+/?) are bounds for the drift of the slowest and 
fastest nodes in the system, respectively. Therefore, 

Apredsion — (3F - 1)*)^ - Z) + Ap)dft- 

Theorem Con verge AllMaintain - A system of K> 3F + 1 nodes satisfying the pre-convergence 
conditions, where all good nodes are in the Maintain state, will always converge. 


Proof - Since no assumptions are made about the initial relative timing of the good nodes, 
ALocaiTimeft) > Apredsion is possiblc. It follows from the protocol and lemma RestoreWithinPM that 
a good node will transition to the Restore state and transmit a Sync message within Pm- A good 
node in the Maintain state keeps track of other nodes that have transitioned to the Restore state. 
We consider the system after {Tr - 1) good nodes have transitioned to the Restore state. There 
are two possible scenarios. In the first scenario, the first {Tr - 1) good nodes that transitioned to 
the Restore state remain in the Restore state until the Tr^ good node transitions to the Restore 
state. In the second scenario, some (or all) of the first {Tr - 1) good nodes that had transitioned 
to the Restore state remain in the Restore state while others (or all) transition back to the 
Maintain state before the Tr^ good node transitions to the Restore state. In either case, after the 
Tr* good node transitions to the Restore state, the remaining good nodes in the Maintain state, at 
least F, receive Tr valid Sync messages from as many good nodes, and transmit Sync messages 
as they transition to the Restore state within the next y. 
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At this point, for the first {Tr - 1) good nodes that remained in the Restore state, 
StateTimer(t) < (TD^m - 1) + F + Tr = 4F < Pr - F - 1. The longest duration is 4F > TD^m, thus, 
such a node has met the first of the transitory conditions. Those good nodes of the first {Tr - 1) 
good nodes that had transitioned to the Restore state and then back to the Maintain state will now 
receive at least {Tr + 1) Sync messages within 2y from as many good nodes, will transmit Sync 
messages, and transition to the Restore state within the next y. These nodes are within 2y oi the 
recently transitioned good nodes, in particular the Tr^ good node, with StateTimer(t) <2 < Pr- 
3F - 1, and none of them has met the transitory conditions. Therefore, the system consists of all 
good nodes in the Restore state with various values for their StateTimers. At one end of the 
spectrum, some good nodes have not met the transitory conditions with StateTimer(t) <Pr-3F - 
1 . In a similar argument as in theorem ConvergeNoneMaintain, since there are up to F faulty 
nodes in the system and if the valid Sync messages are y apart from each other (worst case due 
to transitory condition 2), these good nodes will transition to the Maintain state within at most 
TDmin + F = 3F of the last good node’s Sync message without timing out. Therefore, LM = 
{TDmin + F) -y = 3F-y. 


At the other end of the spectrum, some good nodes have met the first of the transitory 
conditions with StateTimer(t) < Pr - F - 1. Since there are up to F faulty nodes in the system and 
if the valid Sync messages are y apart from each other, these good nodes will transition to the 
Maintain state within the next F of the last good node’s Sync message without timing out. 

In a similar argument as in theorem ConvergeNoneMaintain, EM = D + y and the time 
difference between the Nrm and Nem is given by 

StateTimerEM(t) - StateTimerEM(t) = Aemem- 
Aemem = LM - EM =3F-y - {D + y) = {3F - lyy - D. 

Therefore, such a system always converges to within the initial precision of Aemem- ♦ 

The theorem ConvergeSomeMaintain does not make any assumptions about the initial 
values of the StateTimers of the good nodes. Therefore, its proof encompasses the proof of 
theorem ConvergeNoneMaintain. The proof is more complex and we postpone it until 
subsequent sections where we first address the special cases of F = 1 and F = 0, and then discuss 
the general case of F > 1. In the meantime, we continue the proof with lemmas and theorems 
that apply to the general case of F > 0, 0 < /? « 1, and d>Q. 

Lemma PrecisionLargerThanTDmin - For F > 1, Apredsion > TDmin’y 

Proof - 

Aprecision ^ TDmin' T 

(3F - lyy - D + Aorift > 2F-y 
{F-iyy + Aorift > D. 

Since y>D, even if Aorift = 0, the above inequality reduces to 
(F- iyy>D. 
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Thus, ^Precision ^ TDmin' Y- 


♦ 


Corollary PrecisionTDminFl - ForF = 1, Aprechion > TDmin’Y if Aorift > D. 
Proof - 

Aprecision ^ TDmin' Y 

(3F - IfY - D + Aorift > 2F-Y 
(F - IfY + Aorif, > D 
Aorift > D. 


♦ 


It follows from Lemma PrecisionLargerThanTDmin and Corollary PrecisionTDminFl 
that depending on the amount of drift zlonj?, Apredsion can potentially exceed TDmin, i-C., Apredsion > 
TDmin Y this in turn can result in a disruption in the normal operation of the system. In 
particular and as depicted in Figure 10, in a synchronized system of > 3F + 1 nodes with an 
initial precision of Almem, after elapse of some time, the nodes can drift apart such that Apredsion > 
Almem- Two such cases are depicted in Figures 11 and 12, where the corresponding activities of 
the StateTimer and LocalTimer of Npast are also depicted during the resynchronization process. 
In Figure 11, the fast nodes, Npast, and the slow nodes, Nsiow, are less than Apredsion apart, i.e., 
AtocaiTimeft) < Apredsion- In Figurc 12, Npast and Nsiow are Apredsion apart from each other, i.e., 

ApocalTimer(t) — Apredsion- 


If Npast consists of at least Tr good nodes, then as these nodes transition to the Restore 
state, the remaining good nodes, Nsiow, will follow before timing out as depicted in Figure 11. 
On the other hand, as depicted in Figure 12, if Npast consists of up to {Tr - 1) good nodes, as they 
transition to the Restore state, the remaining good nodes, Nsiow, might not follow. In the 
meantime, assuming Apredsion > TDmin Y in the absence of faulty messages, the Npast nodes 
will meet the transitory conditions and transition to the Maintain state. As the Nsiow nodes 
transition to the Restore state, the Npast nodes will follow and once again transition to the Restore 
state. It follows from theorem Comer geNoneMaintain that such a system always converges. 
Since a minority of good nodes temporarily diverge but then converge with the rest of the good 
nodes, this phenomenon is referred to as momentary-divergence. 

During steady state and re synchronization process, in the absence of a momentary- 
divergence, the StateTimer oscillates twice as depicted in Figure 11. However, in the presence of 
a momentary-divergence, the StateTimer oscillates 4 times as depicted in Figure 12. Proper 
resetting of the LocalTimer during steady state should guarantee that the LocalTimer remains 
immune to the StateTimer oscillations. 
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Figure 11. Activities of Npast during the re synchronization process, ALocaiTimer(t) < Apredsion- 
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Figure 12. Activities of Npast during the re synchronization process, ALocaiTimer(t) = Apredsion- 
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Theorem ClosureAllMaintain - A system of K > 3F + 1 nodes, where all good nodes have 
converged such that Au)caiTimer(t) <Aprecision and all are in the Maintain state, shall remain within 
the synchronization precision Apredsion- 

Proof - It follows from the protocol and lemma RestoreWithinPM that a good node will transition 
to the Restore state within Pm- Since all good nodes are in the Maintain state, as they transmit 
Sync messages, their transitions to the Restore state are recorded by other good nodes. Since the 
system is synchronized, the good nodes will transition to the Restore state within Aprechion of each 
other. The proof proceeds in the following two parts. 

If Aprecision < TDmin’ % all good nodcs will transition to the Restore state before any of them 
transitions back to the Maintain state. In this case, for all good nodes, StateTimer(t) < Pp - 3F - 
1, and none of them has met the transitory conditions. It follows from theorem 
ConvergeNoneMaintain that such a system always converges to within the initial precision of 
Almem- 


On the other hand, if Apredsion > TDmin Y, it follows from Lemma 
PrecisionLargerThanTDmin that some good nodes can potentially transition to the Restore state 
and then to the Maintain state before all good nodes transition to the Restore state. In other 
words, the system can experience a momentary-divergence. Similar to the proof of theorem 
ConvergeAllMaintain, we consider the system after (Tp - 1) good nodes have transitioned to the 
Restore state. There are two possible scenarios. In the first scenario, the first {Tp - 1) good 
nodes that transitioned to the Restore state remain in the Restore state until the Tp^ good node 
transitions to the Restore state. In the second scenario, some (or all) of the first {Tp - 1) good 
nodes that had transitioned to the Restore state remain in the Restore state while others (or all) 
transition back to the Maintain state before the Tp^ good node transitions to the Restore state. In 
either case, after the good node transitions to the Restore state, the remaining good nodes in 
the Maintain state, at least F, receive Tp valid Sync messages from as many good nodes, and 
transmit Sync messages as they transition to the Restore state within the next Y 

At this point, for the first {Tp - 1) good nodes that remained in the Restore state, 
StateTimer(t) < {TDmin - V) + F -v Tp = AF < Pp - F - 1. The longest duration is AF > TDmin, thus, 
such a node has met the first of the transitory conditions. Those good nodes of the first {Tp - 1) 
good nodes that had transitioned to the Restore state and then back to the Maintain state will now 
receive at least {Tp + 1) Sync messages within 2y from as many good nodes, will transmit Sync 
messages, and transition to the Restore state within the next Y These nodes are within 1y oi the 
recently transitioned good nodes, in particular the Tr* good node, with StateTimer(t) <2 < Pp - 
3F - I, and none of them has met the transitory conditions. Therefore, the system consists of all 
good nodes in the Restore state with various values for their StateTimers. At one end of the 
spectrum, some good nodes have not met the transitory conditions with StateTimer(t) <Pp-3F - 
1 . In a similar argument as in theorem ConvergeNoneMaintain, since there are up to F faulty 
nodes in the system and if the valid Sync messages are Y apart from each other (worst case due 
to transitory condition 2), these good nodes will transition to the Maintain state within at most 
TDmin + F = 3F of the last good node’s Sync message without timing out. Therefore, LM = 
{TDmin + F)'Y ='^F"Y 
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At the other end of the speetrum, some good nodes have met the first of the transitory 
conditions with StateTimer(t) < Pr - F - 1. Sinee there are up to F faulty nodes in the system and 
if the valid Sync messages are y apart from eaeh other, these good nodes will transition to the 
Maintain state within the next F of the last good node’s Sync message without timing out. 

Onee again, in a similar argument as in theorem Comer geNoneMaintain, EM = D + y 
and the time differenee between the Nrm and Nem is given by 
StateTimerLM(t) - StateTimerEM(t) = Aemem- 
Almem = LM - EM =3F-r - (D + y) = (3F - lyy - D. 

Therefore, sueh a system always eonverges to within the initial preeision of Aemem- ♦ 

Lemma StateTimerLessThanPreeision - During the re synchronization process, in steady state, 
the maximum value of the StateTimer is always less than the synchronization precision Apredsion- 

Proof - From the protoeol, the StateTimer is reset when the node transitions to either the Restore 
state or the Maintain state. It follows from the proof of theorem ClosureAllMaintain that during 
momentary-divergence some good nodes transition to the Restore state and then baek to the 
Maintain state before others transition to the Restore state. At time t when the last good node 
has transitioned to the Restore state, the value of the StateTimer of earlier nodes that are in the 
Maintain state does not exeeed Apredsion- For these good nodes, StateTimerffy = Apredsion - 
TDmin’y + {D -V d). 


Sinee y>D-vd, 

StateTimer(t)*y ^ Ap^^dsion ~ TD,^id*y + y 
StateTimerffy < Apredsion - (2F - If y 

and for F > 0, 

StateTimer( t)- y < Apredsion- ♦ 

Theorem Congruenee - For all good nodes A, and Nj and for t >C, LocalTimerft) = 0 implies 
that Ni and Nj are in the Maintain state. 

Proof - From theorem ConvergeNoneMaintain it follows that at the point of eonvergenee when 
all good nodes have just transitioned to the Maintain state, the initial preeision is Aemem- h 
follows from Lemma StateTimerLessThanPreeision that, during the resynchronization process, 
in steady state, even when the system experienees a momentary-divergence, StateTimer(t)-y < 
Apredsion for all good nodes. Therefore, in steady state, StateTimer ean reaeh \ Apredsion only 

when the node has transitioned and remained in the Maintain state. Thus, when StateTimer emU) 
= \ Apredsion ! i-C., when Nem resets its LocalTimer, Nem, and henee, all good nodes are in the 

Maintain state. ♦ 

Lemma MaxTransitoryDelay - During steady state, the maximum time a good node stays in the 
Restore state is given by TD^ax = Apredsion + (F -v 2)‘y 
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Proof - Let a good node, Ni, be the first to transition to the Restore state. Let Ni remain in that 
state until all other good nodes transition to the Restore state. In a synchronized system, the 
maximum duration of this time is Apredsion- After the last good node transitions to the Restore 
state, Ni receives another valid Sync message and is forced to remain in the Restore state at the 
next % according to the transitory conditions. At this point, let this node remain in the Restore 
state due to receiving additional valid Sync messages from all faulty nodes, one per f- At the 
next % following the last valid Sync message from the last faulty node, no valid Sync messages 
will be received, there will be a gap of one y interval without a valid Sync message, and the node 
will transition to the Maintain state. Therefore, during steady state, the maximum duration of the 
transitory delay, TDmaxr is given by TDmax — Apredsion + + F’y y — Apredsion i^F + '2,yy ♦ 

This protocol is intended to be used as the fundamental mechanism in bringing and 
maintaining a system within a known time synchronization precision. In particular, the 
LocalTimer is intended to be used by higher level mechanisms. Therefore, proper management 
of the LocalTimer is one of the guaranteed services provided by the protocol. The logical time 
clock LocalTimer is incremented once every local clock tick and is reset either when it reaches 
its maximum allowed value or when the node has transitioned to the Maintain state and remained 
in that state for ResetLocalTimerAt local clock ticks, where ResetLocalTimerAt is constrained by 
inequality (1) as described in Section 3.3. Therefore, Pm and Pp have to be sufficiently large to 
allow time to reset the LocalTimer after the node transitions to the Maintain state. Specifically, 
it follows from Figure 12 and Lemma MaxTransitoryDelay that Pr > [ (TD^ax + Apredsion ) / y\- 

Pr ^ r (TDitiax Apxedsion )//! 

Pr ^ 2 r Aprgcision / ^ + (P + 2) 

Pr>2 (3F- 1) + 2[(ADrift-D) //] + (P + 2) 

PR>lF+l + 2[(ADrift-D)/]^. ( 2 ) 

If 0 < Aorift < D, 

Pr>1F- 1. 


If Aorift = L>, 

Pr>1F+ 1. 

If 2D > Aorift > D, 

Pr>1F+3. 

In general, for all P > 0 and K>3F+ 1, and to prevent early timeout, Pr is constrained 
by (2). The maximum duration for the Maintain state. Pm, is typically much larger than Pr. 
Thus, Pm is constrained by Pm ^ Pr- 

Lemma MaxResyncDuration - During steady state, the maximum re synchronization duration, 
MRD, is given by MRD = 6F- y + Aorift - D- 

Proof - It follows from the first part of theorem ClosureAllMaintain that the time interval from 
when the first good node transitions to the Restore state until all good nodes transition to the 
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Maintain state is given by resynehronization duration (RD), RD = Apredsion + transmission delay 
of last message + LM, where LM = 3F'Y- So, 

Aprecision + + LM ‘S: RD ^ Aprecision + + LM 

((3F -\yy -D + Aorift) +D + 3F-y <RD< ((3F -\yy -D + Aorift) + y + 3F-y 
{6F-\yy + Aorift <RD<6F-y + Aorift - D. 


Therefore, the maximum value is 

MRD = 6F- y + Aorift - D. ♦ 

If the duration of the resynchronization process is small relative to the P, the oseillator 
drift rate, /?, does not play a signifieant role in the resynchronization process. It follows from 
Lemma MaxResyncDuration that MRD < Pr for all F > 0 and /? > 0. Sinee typieally Pm » Pr 
and MRD « P, p’s omission from the expressions regarding parameters, eonstants, and 
equations in the proof of the resynchronization process is justified. 

Theorem LoealTimerWithinPreeision - During steady state, in a system of K > 3F + 1 nodes, 

ApocalTimeri t ) — Aprecision- 


Proof - It follows from Lemma StateTimerLessThanPrecision that, during the re synchronization 
process, in steady state, even when the system experienees a momentary-divergence, the 
StateTimer never reaehes \ Apredsion and thus the LocalTimer will not be reset during this 
process. On the other hand, it follows from theorem ClosureAllMaintain that once synchronized 
the good nodes will remain within Apredsion of each other. Thus, during steady state, ALocainmeft) 

— Apredsion- ^ 

Lemma SyncWithinP - A good node transmits a Sync message within at most (Pr-vPm)- 

Proof - From lemma MaintainWithinPR, a node in the Restore state will time out within Pr. So, 
if a node transitions from the Restore state to the Maintain state before it times out, it had 
remained in the Restore state for at most {Pr - 1). From lemma RestoreWithinPM, the node will 
time out within Pm- It follows from the protocol that a good node transmits a Sync message upon 
entering the Restore state. Therefore, within at most Pr + Pm = P node transmits a Sync 
message. ♦ 

The proof proceeds with the following three parts: F = 1, F = 0, and F > 1. The cases of 
F = 1 and F = 0 are special cases, and the case of F > 1 is the general case. 

5.1. Proof For F = 1 

The remainder of the proof for the case of F = 1 is presented in this section. We first 
present the proof for the ideal case where the logical timers of the good nodes are in-phase with 
respect to each other and the network imprecision and the oscillator drift are zero, i.e., d = 0 and 
Aorift = 0. We then expand the proof to a realizable system in subsequent subsections. 
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5.1.1. In-Phase Case 


In this scenario, the local oscillators and logical timers of all good nodes are assumed to 
be in-phase with each other, and the network imprecision and the drift are zero. Since Zlonj; = 0, 
local oscillators and logical timers remain in-phase with each other. This idea is depicted in 
Figure 13 where y = 2. 


Oscillator 


Logical 2 3 

Timer 1 ' ' ' 

Logical 7 8 

Timer 2 ' ' ' 

Figure 13. Ideal case, transitions of logical timers are in-phase. 

Theorem ConvergeSomeMaintainFlK4 - A system of F=1 and K=3F+1=4 nodes satisfying the 
pre-convergence conditions with some of the good nodes in the Maintain state will always 
converge. 

Proof - The good nodes in the Restore state are there beeause they have not met the transitory 
conditions. There are three possible scenarios for the system. 

Case 1 - All good nodes in the Restore state transition to the Maintain state before the nodes in 
the Maintain state transition to the Restore state. It follows from theorem ConvergeAllMaintain 
that such a system always converges to within Almem- 

Case 2 - All good nodes in the Maintain state transition to the Restore state at least (3F - lyy 
before the nodes in the Restore state transition to the Maintain state. It follows from theorem 
Comer geNoneMaintain that such a system always converges. 

Case 3 - Some good nodes transition in and out of the Restore state while others transition in and 
out of the Maintain state. This scenario encompasses those that are not covered in case 2 above. 
Since there are three good nodes, let’s consider the system where Ni transitions to the Restore 
state while N 2 transitions to the Maintain state. Therefore, N 2 and N 3 receive a valid Sync 
message within the next y Now, let’s consider the following two sub-cases. 

Case 3.1 - A? is in the Maintain state. After either N 2 or N 3 transitions to the Restore state, the 
other node reeeives a total of Tr valid Sync messages and transitions to the Restore state. Recall 
that receiving a valid Sync message prevents a node from transitioning to the Maintain state 
{transitory condition 2). In the mean time, if Ni had remained in the Restore state due to not 
meeting the transitory conditions, then for all good nodes, StateTimer(t) < Pr - 3F - 1, and it 
follows from theorem ConvergeNoneMaintain that such a system always converges. If A; had 
transitioned to the Maintain state before the Sync message from A 2 and N 3 arrive. A; receives Tr 
valid Sync messages and returns to the Restore state. Note that the transitory conditions prevent 
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N] from transitioning to the Maintain state while N 2 and Ns transition to the Restore state. Onee 
again, at this point and for all good nodes, StateTimer(t) < Pr - 3F - 1, and it follows from 
theorem Comer geNoneMaintain that sueh a system always eonverges. 


Table 1. Aetivities of a system of = 4 nodes and F = 1. 
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Due to TimeOutRestoreO 

to+Fft+l 

3-x- 

r- 

1-X'X ^ Q ^ Restore 

to+F r +2 

4-x- 

2-x- 

1 "" 

to+F r +3 

5-x 

3 '"' ^ O'”', Maintain 

2-x 

to+F r +4 

6 '"' ^ O””, Maintain 

1 "" 

3 ^ 0 , Maintain 

to+F r +5 

r"x 

2-' 

1 "" 


Case 3.2 -N3 is in the Restore state. For this seenario, the system is analyzed at about Pr from to 
on the time axis (see pre-convergence eonditions) when N 3 is about to time out. The node 
aetivities are deseribed with the help of the above table. 

Table 1 is an execution trace of a system with parameters F = 4, F = 1, with no clock 
drift, Aorift = 0. Nodes 1, 2, and 3 are good nodes, and node 4 is the faulty one. The table has 
four columns, one for time reference and one for each good node. A row depicts activities of all 
good nodes at the corresponding time. Cell contents for the node columns consist of a number 
(corresponding to the value of its StateTimer) representing the internal status of a node with the 
stored messages as superscripts, and a description of the possible action by the node. Symbol ‘x’ 
represents a received valid Sync message and symbol represents no valid Sync message 
received from the corresponding node. The position of superscripts, 1 thru 4, corresponds to the 
source of the message. 

At (to+Ffi -2) in the table, Ni transitions to the Restore state while N2 transitions to the 
Maintain state and N3 is in the Restore state. There are two possible cases regarding N3. If N3 
times out within the next y it will transition to the Maintain state and receive and retain the Sync 
message from Ni. It follows from Case 3.1 above that this system always converges. This is a 
trivial case and not shown in the above table. 

If N 3 times out after consuming the Sync message from N], i.e., at (to+PR) as listed in the 
above table, in order to keep Ni in the Restore state and prevent the system from synchronizing, 
Ni has to receive a new valid Sync message. If Ni receives a valid Sync message from another 
good node, it will have to be from N3 and it follows from Case 3.1 above that this system always 
converges. So, let Ni receive a valid Sync message from the faulty node. At the next y i.e., at 
(to+Ffi+1), Ni must receive a message from another good node. Let that message come from N 2 , 
i.e., N 2 transitions to the Restore state due to receiving a message from the faulty node at (to+Pid- 
Therefore, at (to+Fs+l), N3 and Ni will have received one new valid Sync message and Ni will 
have to stay in the Restore state for the next y. To keep Ni in the Restore state and prevent the 
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system from synchronizing, Ni has to receive a new valid Sync message every y. To meet the 
^ss,min timing requirement, the next message has to be from a good node, thus at (to+PR+l), Ns 
has to transition to the Restore state. In this case all good nodes will be in the Restore state with 
StateTimer at two extremes; StateTimer(t) < 3 < (Pr - 1) and Ni and N 2 have met the first of 
transitory conditions, and StateTimer(t) = 0 < (Pr - 3F - 1) and Ns has not met the first of the 
transitory conditions. In a similar argument as in the proof of theorems ConvergeNoneMaintain 
and ConvergeAllMaintain these nodes will not time out and such a system always converges to 
within the initial precision of Armem- Otherwise, at (to+Ps+l), Ni and N 2 will transition to the 
Maintain state and the system consists of Tr good nodes in the Maintain state. It follows from 
Case 3.1 above that this system always converges. ♦ 

Theorem ConvergeSomeMaintainFlKGT4 - A system of F=1 and K >3F+1 nodes, satisfying 
the pre-convergence conditions with some good nodes in the Maintain state, will always 
converge. 

This case is a generalization of theorem ConvergeSomeMaintainF 1K4 for F > K. In a 
similar argument, such a system always converges. We do not provide the details of the proof of 
this theorem here but would like to point out that the system consists of three sets of good nodes. 
Si, S 2 , and Ss, where Ki = I Si I, i = 1, 2, 3. Since > 3F + 1, G > 2F + 1 and at least one set. Si, 
has Ki > Tr. In other words, the presence of the additional good nodes expedites the 
convergence process. 

Theorem ConvergeSomeMaintainFl - A system of F=1 and K > 3F+1 nodes, satisfying the 
pre-convergence conditions with some good nodes in the Maintain state will always converge. 

Proof - It follows from theorems ConvergeSomeMaintainF 1K4 and 

ConvergeSomeMaintainF 1KGT4 that such a system always converges. ♦ 

Theorem StabilizeFl - A system of F = 1 and K > 3F -v 1 nodes self-stabilizes from any 
random state after a finite amount of time. 

Proof - The proof of this theorem consists of proving the convergence, closure, and congruence 
properties as defined in section 3.6. The approach for the proof is to show that a system of F > 
3F + 1 nodes converges from any condition to a state where all good nodes are in the Restore 
state and then synchronously transition to the Maintain state within a guaranteed initial precision. 
The system is then shown to remain within the timing bounds of the synchronization precision of 
Aprecision- This idea is depicted in Figure 6. 

The approach for the proof is depicted in Figure 8. The system is shown to converge 
from any state and upon convergence maintain the closure property. 

Convergence Property - AL„cammer(C) <Apredsion. 

The proof is done in the following three cases. 

Case 1 - None of the good nodes are in the Maintain state. 
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For all good nodes, if StateTimer(t) < Pr - 3F - 1, it follows from theorem 
ConvergeNoneMaintain that such system always converges. Otherwise, it follows from 
theorem ConvergeSomeMaintainFl that such system always converges. 

Case 2 - All good nodes are in the Maintain state. 

It follows from theorem ConvergeAllMaintain that such system always converges. 

Case 3 - Some of the good nodes are in the Maintain state. 

It follows from theorem ConvergeSomeMaintainFl that such system always converges. 
Closure Property — \/ 1 ^ C, ARocalTimeft) ^ Api-edsion- 

It follows from theorems ClosureAllMaintain and LocalTimerWithinPrecision that upon 
convergence, such system always remains stabilized and Aiocammeft) <Apredsion for t >C. 

Congruence Property - 1/ good nodes Ni and Nj, Vt >C, LocalTimerft) = 0 ^ Nt and Nj in the 
Maintain state. 

It follows from theorem Congruence that upon convergence, this property is satisfied. 

Therefore, such system always self-stabilizes. ♦ 

Since this protocol self-stabilizes from any state, initialization and/or reintegration are not 
treated as special cases. Therefore, a reintegrating node will always be admitted to participate in 
the self-stabilization process as soon as it becomes active. 

Theorem ConvergeTime - A system ofK > 3F + 1 nodes and F <1 converges from any random 
state to a stabilized state within C = (2Pr + Pm)‘Y- 

Proof - In order for the system to stabilize, all good nodes must undergo the resynchronization 
process. It follows from Lemma SyncWithinP that a good node initiates this process by 
transmitting a Sync message within at most P. It follows from theorem StabilizeFl that the 
system always converges. Also, it follows from that theorem and Lemma MaxResyncDuration 
that the system converges and all good nodes will transition to the Maintain state at the end of 
the resynchronization process and within the next Pr. Therefore, the system converges within at 
most {{Pr + Pm) + Pr)' / and C = {2Pr + Pm)' Y ♦ 

Since PActuai < P and typically Pm » Pr, the maximum convergence time, C, can be 
approximated to C = P. Therefore, C is a linear function of P and, similarly, of Pm- 

5.1.2. Out-of-Phase Case 

The out-of-phase scenario is defined as a system where the logical timers of the good 
nodes are out of phase with each other, but the local oscillators are in-phase, and the network 
imprecision and the oscillator drift are also zero, i.e., d = 0 and Aorift = 0. In this scenario, since 
Aorift = 0, the local oscillators of all good nodes remain in-phase with each other. This idea is 
depicted in the following figure where = 2. 
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Oscillator 


Logical 2 3 

Timer 1 ' ' ' 

Logical 7 8 

Timer 2 ' ' 

Figure 14. Ideal case, transition of logical timers are out-of-phase. 

In this scenario, d = 0 and Aorift = 0, therefore, y = D. We do not provide a paper- and- 
pencil proof of the out-of-phase scenario here. The proof follows the same line of reasoning as 
above. 


A system of = 4 nodes for = 1, 2, 3, and 4 were model checked and proven to self- 
stabilize in the presence of one arbitrary faulty node as expected. Details of the model checking 
effort for this scenario will be the subject of a subsequent report. 

5.1.3. A Realizable System 

A realizable system is defined as a system where the logical timers and the local 
oscillators of the good nodes have unconstrained relative phases, and the network imprecision 
and the oscillator drift are not constrained to be zero, i.e., d > 0 and Aorift > 0. In this scenario, no 
assumptions are made about the relative phase difference of local oscillators and logical timers of 
the good nodes. The local oscillators and logical timers of all good nodes may drift apart with 
respect to each other. Here, we constrain a realizable system such that 0 < d < 1. For such a 
system we also constrain D > 1. Therefore, y> 2. A paper-and-pencil proof for this scenario is 
more complex and is left for future work. 

Nevertheless, we focus our attention here on the model checking results and report on the 
issues associated with mapping this protocol to a real system. Since for model checking 
purposes d is treated as an integer [Mai 2007, 2008], its value is randomly selected to be either 0 
or 1 for a given transmission of a Sync message. Several such systems with d= {0, 1 ), Z) = 1,2, 
or 3, and y = 2, 3, or 4, respectively, were model checked and proven to self-stabilize in the 
presence of one arbitrary faulty node. Details of the model checking of this scenario will be the 
subject of a subsequent report. The results reveal that for such a realizable system to self- 
stabilize two additional good nodes are needed. As shown in the following counterexample, a 
system of K = 4, d = {0, 1|,D = 1, and y = 2 does not self-stabilize. 
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Table 2. Counterexample for a system of ^ = 4 nodes and F = 1. 


Time 

Nodel 

Node 2 

Node 3 

t + 0 

1”XX ^ 0"XX, Restore 

4-x- 

T" 

t-r 1 

Q-XX 

4‘ ‘ ^ O'”', Maintain 

T" 

t + 2 

r" 

QX- 

T- 

t + 3 

T“ 

IX- 

T- 

t-i-4 

2‘ ■ ^ O'”', Maintain 

1X--X ^ Qx-'x, Restore 

3-x 

t + 5 

Q-X- 

QX-X 

3-x-x 

t -1- 6 

1-x-x ^ 0‘X'X, Restore 

1‘- 

4-x- 

t + 1 

Q-X-X 

1‘- 

4' ■ ^ O'”', Maintain 

t + 8 

T" 

T- 

QX- 

t + 9 

T" 

2X-X 

IX- 

t-r 10 

2' ‘ ^ O'”', Maintain 

3-x 

ix-x ^ QX-x, Restore 

t-r 11 

Q-X- 

3-x- 

QX-X 

t-r 12 

r-xx ^ 0"xx, Restore 

4-x- 

I' 


There are two solutions for the above system: either the Byzantine-faulty node is 
restrieted to influenee the nodes at greater intervals, or additional good nodes are added to the 
system. 


We define three types of Byzantine faulty behaviors here. Reeall that Ass, mm = TDminY + 
1 clock ticks and for F = 1, Ass, min = '2.y + 1. 

• A type-A Byzantine faulty node transmits Sync messages arbitrarily but at intervals 
greater than or equal to Ass,min, measured separately at each receiving good node. Note 
that this type is a redefinition of the Byzantine faulty node behaving arbitrarily at every 
clock tick but is tailored for this protocol. 

• A type-B Byzantine faulty node transmits Sync messages arbitrarily but at intervals 
greater than or equal to 3y + 1, measured separately at each receiving good node. 

• A type-C Byzantine faulty node transmits Sync messages arbitrarily but at intervals 
greater than or equal to Ay + 1, measured separately at each receiving good node. 

Model checking results indicate that the system of F = 4 nodes described above self- 
stabilizes in the presence of a type-C Byzantine faulty node. Alternatively, after the addition of 
another good node, a system of F = 5 nodes self- stabilizes in the presence of a type-B Byzantine 
faulty node. The following table is a counterexample for a system of F = 5 nodes and in the 
presence of a type-A Byzantine faulty node. 
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Table 3. Counterexample for a system of ^ = 5 nodes and F = 1. 


Time 

Nodel 

Node 2 

Node 3 

Node 4 

t + 0 

1-xx^O -XX, Restore 

4 ^0 , Maintain 

3 -x 

r - 

t+ 1 

Q-XX 

Qx-- 

3X-X 

T- 

t + 2 

1 

IX- 

4X- 

T- 

t + 3 

1 

ix-x^QX-x, Restore 

4 ^0 , Maintain 

3--X 

t + 4 

2-x- 

QX-X 

Q-x- 

3-x-x 

t + 5 

2-x- 

1 ■ 

px- 

4-x- 

t + 6 

3""X 

1- 

rx-x^O'X-x, Restore 

4 ^0 , Maintain 

t + 7 

3-x-x 

2-x- 

Q-X-X 

Q-x- 

t + 8 

4 -x- 

2-x- 

1 

1-x- 

t + 9 

4 ^0 , Maintain 

3‘-x 

1 

1-x-x^O-x-x, Restore 

t+ 10 

Q-x- 

3-xx 

2-x- 

Q-X-X 

t+ 11 

1 -x- 

4 -x- 

2-x- 

1 ■ 

t+ 12 

1-xx^O-xx, Restore 

4 ^0 , Maintain 

3--X 

1- 


After addition of another good node, a system of F = 6 nodes self-stabilizes in the 
presence of a type-A Byzantine faulty node. The following table is a summary of the conditions 
that a system in this scenario requires to self-stabilize in the presence of a Byzantine-faulty node. 

Table 4. A realizable system with 0 < cf < 1 and F = 1. 


K 

Byzantine Node Ass.min Intervals 

4 

4y + 1 

5 

3r+ 1 

6 

2y + 1 


5.2. Proof ForF = 0 

The proof for the case of F = 1 readily applies to the special case of F = 0 and K>2. We 
present the proof of this special case separately. 

Theorem ConvergeNoneMaintainFO - A system of K >2 nodes satisfying the pre-convergence 
conditions with none of the good nodes in the Maintain state and for all good nodes, 
StateTimer(t) < Pr- 2, will always converge to within an initial precision. 

Proof - The proof is similar to the proof of the general case as presented in theorem 
ConvergeNoneMaintain. Since none of the good nodes are in the Maintain state, they are in the 
Restore state because they have not met the transitory conditions. We consider the system at the 
point of transmission of a Sync message by the last good node in the Maintain state, where for all 
nodes StateTimer(t) < Pr - 2. The nodes will receive one last valid Sync message, will remain in 
the Restore state for another % and since there are no faulty nodes present, all nodes will 
transition to the Maintain state within the following y. Thus, they will not time out while in the 
Restore state. 
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Similar to the argument in the proof of theorem ConvergeNoneMaintain, the earliest a 
good node transitions to the Maintain state is at Iem where EM = D + y. The latest a good node 
transitions to the Maintain state is at tim and after remaining in the Restore state for TDmin = 2, 
i.e., LM = 2y So, the time differenee between the Nlm and Nem is given by 

StateTimerEM(t) - StateTimerEM(t) = Aemem- 

Almem = LM-EM=2y-(D+y) = y-D = d. 

Therefore, sueh a system always eonverges to within the initial preeision of Aemem- ♦ 

Theorem ConvergeFO - A system ofE = 0 and K >2 good nodes will always converge. 

Proof - The proof follows in the following three eases. 

Case 1 - All nodes are in the Maintain state. Sinee there are no faulty nodes present, Tr = 1. 
Therefore, as soon as one of the nodes transitions to the Restore state, all others will follow 
within the next y. All good nodes will transition to the Restore state within y of eaeh other. At 
this point, for all good nodes, StateTimer(t) < 2 « Pr - 2 and none of them has met the 
transitory conditions. Since there are no faulty nodes present, the nodes will transition to the 
Maintain state within the next y and thus will not time out while in the Restore state. It follows 
from theorem ConvergeNoneMaintainEO that such a system always converges to within Aemem- 

Case 2 - All nodes are in the Restore state. Since in this case no other assumptions are made, 
some nodes transition to the Maintain state due to timeouts while others by meeting the 
transitory conditions. Nevertheless, all nodes will transition to the Maintain state within TDmin y 
of each other. As a result the initial precision is TDmin y It follows from case 1 that this system 
will be within the initial precision of Aemem within the next synchronization round. 

Case 3 - Some nodes are in the Maintain state while others are in the Restore state. If the nodes 
in the Restore state transition to the Maintain state before the nodes in the Maintain state time 
out, then the system will consist of all nodes in the Maintain state. It follows from case 1 that 
such system always converges. Conversely, if the nodes in the Maintain state transition to the 
Restore state before the nodes in the Restore state time out, then the system will consist of all 
nodes in the Restore state. It follows from case 2 that such system always converges. If some 
nodes transition to the Maintain state due to time out, while at least one other node transitions to 
the Restore state, then since Tr = 1, all nodes that have transitioned to the Maintain state will 
transition back to the Restore state within the next y. It follows from case 2 that such system 
always converges. Therefore, a system of F = 0 and K>2 good nodes will always converge to 
within Aemem- 

From theorem ConvergeNoneMaintainEO the initial guaranteed precision after the 
resynchronization is the maximum value of Aemem- For this case, since F = 0, TDmin = 2 and 

StateTimerEM(t) - StateTimerEM(t) = Aemem- 

Aemem = LM - EM = {TDmin + F^‘y - {E> + f) = 2y -{D+f) = y- D = d. 
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Therefore, the system eonverges to within at most (y - D) = d of eaeh other. 


♦ 

The guaranteed synehronization preeision Apredsion after elapsed time of P is bounded by, 

Apredsion — Au^jem + Ap)j-ift- 

For F = 0, 

Apredsion — Au^jem + Ap)rift — d + Ap)rift- 

Corollary PreeisionTDminFO - For F = 0, Apredsion > TDmi„-y if Aorift > D + y 
Proof - 

Apredsion ^ TDmin F 

{y-D) + Aorift > 'ly 

Aorift > D + y ♦ 

Theorem ClosureAllMaintainFO - A system of F = 0 and K > 2 good nodes, where all nodes 
have converged such that all nodes are in the Maintain state and AEocaiTimeft) < Apredsion, shall 
remain within the synchronization precision Apredsion- 

Proof - Since all good nodes are in the Maintain state, it follows from lemma RestoreWithinP m 
that upon timeout, the nodes will transmit Sync messages and transition to the Restore state 
within Pm- As they transmit Sync messages, their transitions to the Restore state are recorded by 
other good nodes that are in the Maintain state. Furthermore, since Fr = 1, as soon as one node 
transitions to the Restore state, the other nodes will transition to the Restore state within the next 
y At this point, for all good nodes, StateTimer(t) < 2 « Pp - 2 and none of them has met the 
transitory conditions. Since there are no faulty nodes present, the nodes will transition to the 
Maintain state within the next y and, thus, will not time out while in the Restore state. It follows 
from theorem ConvergeNoneMaintainFO that such a system always converges to within Aemem- ♦ 

Theorem StabilizeFO - A system ofF = 0 and K >2 nodes self- stabilizes from any random state 
after a finite amount of time. 

Proof - It follows from theorem ConvergeFO that the nodes converge and upon convergence 
transition to the Maintain state within Aemem of each other. It follows from theorem 
ClosureAllMaintainFO that such system of F = 0 and K > 2 nodes always remains within the 
Apredsion bounds. Thus, ALocaiTimerit) < Apredsion- It follows from thcorcm Congruence that upon 
convergence, this property is satisfied. Therefore, such system always self-stabilizes. ♦ 

5.3. Generalization Of The Protocol, For F > 1 

It follows from theorems StabilizeFO, ConvergeNoneMaintain, ConvergeAllMaintain, 
and ClosureAllMaintain, and their corresponding assumptions that a system under their 
associated conditions always self-stabilizes for all F > 0 and 0 < /? « 1. It can readily be shown 
that in the absence of faulty nodes, this protocol always converges from any arbitrary state. 
Also, if the faults are transient such that the time interval between the consecutive manifestations 
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of the transient faults is greater than the convergence time C, the system always self-stabilizes 
for all F > 0. Nevertheless, since theorem ConvergeSomeMaintainF 1 cannot be generalized, this 
protocol does not seem to solve the general case of clock synchronization for F > 1. Table 5 is a 
trace of a counterexample for a system with F = 8 nodes, where F = 2 and G = 6. 

Let’s consider the system where some good nodes transition in and out of the Restore 
state while others transition in and out of the Maintain state. Also, let the system consist of three 
sets of good nodes. Si, S2, and S3, where F = I Si I, i = 1, 2, 3. For simplicity, let’s assume that all 
good nodes in a set 5, are in synchrony with each other such that they all transition from one state 
to another at the same time. Now let’s consider Ki = K 2 = K 3 = F < Tr. The following table 
depicts a scenario that repeats indefinitely and reveals that such a system will not always 
converge. 


Table 5. Activities of a system of F < 4F nodes, F > 1. 


Time 

SI 

S2 

S3 

t + 0 

2-x 

2X- 

5”” ^ 0'”', Maintain 

t + 1 

3-x 

3X-X ^ QX-X^ Restore 

r" 

t + 2 

4-x- 

j-X- 

2-x- 

t + 3 

5”" ^ 0'”', Maintain 

2-x 

3-x-x ^ Q-x-x^ Restore 

t + 4 

r-x- 

3-x- 

1-x- 

t + 5 

2-x- 

4-x 

2-x 

t + 6 

3-xx^ 0"“, Restore 

5'”' ^ 0'”', Maintain 

3-x 

t + 7 

r- 


4X- 

t + 8 

2-x 

2X- 

5”" ^ 0'”', Maintain 


The scenario that repeats consists of a set transitioning to the Restore state at the same time 
another set transitions to the Maintain state while the third set is in the Restore state. For 
instance, at t+6, Sj transitions to the Restore state, S2 transitions to the Maintain state and S3 
remains in the Restore state. Therefore, during the next y sets receive up to F valid Sync 

messages. The set S3 is forced to remain in the Restore state while S2 has up to F valid Sync 
messages. 

Although this protocol does not solve the general case of this problem, it provides 
mathematically proven and mechanically verified [Mai 2007, 2008] partial solutions for specific 
cases of F = 0 and F = 1 . We intend to use these specific cases as the building blocks for larger 
and more complex systems. 


6. Protocol Overhead 

Since only one message, namely Sync, is required for the operation of this protocol, the 
protocol overhead during steady state is at most (depending on the amount of Aorift) two 
messages per P. Also, since only one message is needed, a single binary value is sufficient to 
represent it. 
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7. Possible Applications 

The proposed self-stabilizing protoeol is expeeted to have many praetieal applieations as 
well as many theoretieal implieations. Embedded systems, distributed proeess eontrol, 
synehronization, fault toleranee with Byzantine agreement, computer networks, the Internet, 
Internet applications, security, safety, automotive, aircraft, wired and wireless 
telecommunications, graph theoretic problems, leader election, time division multiple access 
(TDMA), and the SPIDER^ architecture [Tor 2005A, 2005B] at NASA-LaRC are a few 
examples. These are some of the many areas of distributed systems that can use self-stabilization 
in order to design more robust distributed systems. 


8. Conclusions 

The self-stabilization problem has two facets. It is inherently event-driven and it is also 
time-driven. Most attempts at solving the self-stabilization problems have focused only on the 
event-driven aspect of this problem. Additionally, all efforts toward solving this problem must 
recognize that the system undergoes two distinct phases, un-stabilized and stabilized, and that 
once stabilized, the system state needs to be preserved. The protocol presented here properly 
merges the time and event driven aspects of this problem in order to self-stabilize the system in a 
timely manner. Initialization and/or reintegration are not treated as special cases. These 
scenarios are regarded as inherent part of this self-stabilizing protocol. 

In this report, a rapid Byzantine-fault-tolerant self-stabilizing clock synchronization 
protocol is presented. The protocol presented here is independent of specific application- 
dependent requirements and is focused only on clock synchronization of a system in the presence 
of Byzantine faults and after the cause of transient faults has dissipated. The protocol utilizes a 
single message. Sync, and during steady state imposes an overhead of at most two messages per 
synchronization period. A model of this protocol has been mechanically verified using SMV 
[SMV] where the entire state space has been examined and proven to self-stabilize in the 
presence of one arbitrary faulty node. Instances of the protocol have been proven to tolerate 
bursts of transient failures and deterministically converge with a linear time with respect to the 
synchronization period as predicted. This protocol does not rely on any assumptions about the 
initial state of the system except for the presence of sufficient number of good nodes, and no 
assumptions are made about the internal status of the nodes, the monitors, and the 
communication channels, thus making the weakest assumptions and producing the strongest 
results. All timing measures of variables are based on the node’s local clock and thus no central 
clock or externally generated pulse is used. The Byzantine faulty behavior modeled here is a 
node with arbitrarily malicious behavior. The Byzantine faulty node is allowed to influence 
other nodes at every clock tick. The only constraint is that the interactions are restricted to 
defined interfaces. 

Proofs of specific instances of this protocol are presented in this report. This protocol has 
been the subject of a rigorous verification effort. A system tolerating one Byzantine faulty node 


* Scalable Processor-Independent Design for Enhanced Reliability (SPIDER). 
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has been model eheeked for the in-phase, out-of-phase, and realizable systems. The SMV model 
eheeking results verified the eorreetness of the elaims of this self-stabilizing protoeol. 

Although this protoeol does not solve the general ease of the problem, it provides proven 
and verified solutions for speeifie eases. The paper-and-peneil proofs presented here, in 
eonjunetion with the model eheeking results, indieate that the protoeol is applieable to realizable 
praetieal systems. We intend to leverage the speeifie eases as building bloeks for larger and 
more eomplex systems. 

This protoeol is intended to be the fundamental meehanism for bringing and maintaining 
a system within bounded synehrony. Formalization and verifieation of the integration proeess of 
other protoeols with this protoeol in order to aehieve tighter preeision are underway. 
Nevertheless, proper means are embedded in this protoeol to aeeommodate the integration 
proeess. Implementation of this protoeol in hardware and its eharaeterization in a representative 
adverse environment are being planned. 
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Appendix A. Symbols 


This appendix the symbols used in the protoeol. 


Symbols 

P 

d 

D 

F 

G 

K 

Sync 

S 

^ss 

Tr 

Restore 

Maintain 

R 

M 

Pr 

PR.min 

Pm 

P Actual 

P 


Y 

C 

^LocalTimeri t ) 

LM 

EM 

^LMEM 

^Precision 

^Drift 

Ni 

Mi 


Descriptions 

bounded drift rate with respeet to real time 
network impreeision 
event-response delay 
maximum number of faulty nodes 
minimum number of good nodes 
sum of all nodes 
self-stabilization message 
abbreviation for Sync message 

time differenee between the last eonseeutive Sync messages 

threshold for Retry( ) funetion 

self-stabilization state 

self-stabilization state 

abbreviation for Restore state 

abbreviation for Maintain state 

maximum duration while in the Restore state 

minimum value of Pr 

maximum duration while in the Maintain state 
aetual synehronization period 
synehronization period 

equally spaeed time intervals for time-driven aetivities 
maximum eonvergenee time 

maximum time differenee of LocalTimers of any two good nodes at real time t 
Latest Maintain 
Earliest Maintain 

differenee of LM and EM, a.k.a. initial synehronization preeision 
maximum synehronization preeision 
maximum deviation from the initial synehrony 
the i* node 

the i^’’ monitor of a node 


36 




REPORT DOCUMENTATION PAGE 


Form Approved 
OMB No. 0704-0188 


The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, 
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this 
collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and 
Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person 
shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 

PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS. 


5b. GRANT NUMBER 

5c. PROGRAM ELEMENT NUMBER 


5e. TASK NUMBER 

5f. WORK UNIT NUMBER 

645846.02.07.07.15.02 


11. SPONSOR/MONITOR'S REPORT 
NUMBER(S) 

NASA/TM-2009-2 15758 

12. DISTRIBUTION/AVAILABILITY STATEMENT 

Unclassified - Unlimited 
Subject Category 62 

Availability: NASA CASI (443) 757-5802 

13. SUPPLEMENTARY NOTES 


14. ABSTRACT 

This report presents a rapid Byzantine-fault-tolerant self-stabilizing clock synchronization protocol that is independent of 
application-specific requirements. It is focused on clock synchronization of a system in the presence of Byzantine faults after the cause of 
any transient faults has dissipated. A model of this protocol is mechanically verified using the Symbolic Model Verifier (SMV) [SMV] 
where the entire state space is examined and proven to self-stabilize in the presence of one arbitrary faulty node. Instances of the protocol are 
proven to tolerate bursts of transient failures and deterministically converge with a linear convergence time with respect to the 
synchronization period. This protocol does not rely on assumptions about the initial state of the system other than the presence of sufficient 
number of good nodes. All timing measures of variables are based on the node’s local clock, and no central clock or externally generated 
pulse is used. The Byzantine faulty behavior modeled here is a node with arbitrarily malicious behavior that is allowed to influence other 
nodes at every clock tick. The only constraint is that the interactions are restricted to defined interfaces. 

15. SUBJECT TERMS 

Byzantine-Fault-Tolerant; Clock Synchronization; Formal Verification; Model Checking; Protocol; Self-Stabilization 


8. PERFORMING ORGANIZATION 
REPORT NUMBER 

L-19568 

10. SPONSOR/MONITOR'S ACRONYM(S) 

NASA 


7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 

NASA Langley Research Center 
Hampton, VA 23681-2199 

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 

National Aeronautics and Space Administration 
Washington, DC 20546-0001 


5d. PROJECT NUMBER 


3. DATES COVERED (From - To) 
5a. CONTRACT NUMBER 


1. REPORT DATE (DD-M/M-YYYY) 2. REPORT TYPE 

0 1 - 06 - 2009 T echnical Memorandum 

4. TITLE AND SUBTITLE 

A Self-Stabilizing Byzantine-Fault-Tolerant Clock Synchronization 
Protocol 


6. AUTHOR(S) 

Malekpour, Mahyar R. 


16. SECURITY CLASSIFICATION OF: 

17. LIMITATION OF 
ABSTRACT 

18. NUMBER 
OF 

19a. NAME OF RESPONSIBLE PERSON 

a. REPORT 

b. ABSTRACT 

c. THIS PAGE 

PAGES 

STI Help Desk (email: help(@sti.nasa.gov) 






19b. TELEPHONE NUMBER (Include area code) 

u 

u 

u 

uu 

43 

(443) 757-5802 


Standard Form 298 (Rev. 8-98) 

Prescribed by ANSI Std. Z39.18 



































